import pandas as pdimport numpy as npimport plotly.express as pximport matplotlib.pyplot as pltimport seaborn as snsimport xgboost as xgbimport lightgbm as lgbfrom sklearn.model_selection import train_test_splitfrom sklearn import treefrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import HistGradientBoostingClassifier, GradientBoostingClassifier, RandomForestClassifierfrom sklearn import metrics# Loading in dataurl ="https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv"dwellings_ml = pd.read_csv(url)h_subset = dwellings_ml.filter( ['livearea', 'finbsmnt', 'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 'before1980', 'stories', 'yrbuilt', 'sprice', 'floorlvl', 'condition_Excel', 'condition_VGood', 'condition_AVG', 'condition_Good'])# Older homes are more likely to not have anything left unfinishedh_subset['unfinishedbasement'] = h_subset['basement'] - h_subset['finbsmnt']# Older homes will ahve a lower value per square footh_subset['pricepersqft'] = h_subset['sprice'] / h_subset['livearea']# Older homes would be more likely to have sustained more wear and tearh_subset['condition'] = h_subset['condition_Excel'] *10+ h_subset['condition_VGood'] *8+ h_subset['condition_Good'] *6+ h_subset['condition_AVG'] *4X = h_subset[['livearea', 'finbsmnt', 'basement', 'nocars', 'numbdrm', 'numbaths', 'stories', 'unfinishedbasement', 'pricepersqft', 'condition']]y = h_subset['before1980']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.32, random_state=42)xgb_model = xgb.XGBClassifier(n_estimators=250, max_depth=6, learning_rate=0.1)xgb_model.fit(X_train, y_train)# Get feature importances from the trained XGBoost modelfeature_importances = xgb_model.feature_importances_# Create a DataFrame for feature importancesfeature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})# Sort the DataFrame by importance in descending orderfeature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)# Display the feature importance tableprint(feature_importance_df)# Create a heatmap of feature importancesplt.figure(figsize=(10, 6))sns.heatmap(feature_importance_df[['Importance']].sort_values('Importance', ascending=False), annot=True, cmap='viridis', fmt=".2f")plt.title('Feature Importances (XGBoost)')plt.xlabel('Importance')plt.ylabel('Features')plt.show()
The most important feature to use when determining if a house was built before 1980 is the number of stories. This feature had the greatest effect on the model’s accuracy by far. Another important feature had to be engineered. That feature was the condition of the house as the older a house is the less likly it will be in excellend or very good condition. In the end it was a combination of selecting the best features as well as engineering a couple of additional features to train off of that allowed me to achieve over 90% accuracy.
QUESTION|TASK 1
Create 2-3 charts that evaluate potential relationships between the home variables and before1980. Explain what you learn from the charts that could help a machine learning algorithm.
This chart demonstrates the relationship between the number of cars and the number of bedrooms. The data would seem to indicate that no car and single car garages are a good indicator that the house was built before 1980. Also with regards to 2 car garages, only very small houses (1-2 bedrooms) would be expected to have a 2 car garage if and only if they were build after 1980. Any isntances where the house has a 2 car garage and has 3 or more bedrooms is likely to be built before 1980.
This chart gives me some ideas as to what fields are more indicitave of a house being built pre-1980. Specifically it looks like the fields finbsmnt, basement, nocars, numbdrm are some of the better fields to use in training my model.
Read and format data
# Loading in packagesimport pandas as pdimport numpy as npimport plotly.express as pxfrom sklearn.model_selection import train_test_splitfrom sklearn import treefrom sklearn.naive_bayes import GaussianNB# Loading in dataurl ="https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv"dwellings_ml = pd.read_csv(url)h_subset = dwellings_ml.filter( ['livearea', 'finbsmnt', 'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 'before1980', 'stories', 'yrbuilt', 'sprice', 'floorlvl', 'condition_Excel', 'condition_VGood', 'condition_AVG', 'condition_Good'])corr = h_subset.corr()px.imshow(corr, text_auto=True)
QUESTION|TASK 2
Build a classification model labeling houses as being built “before 1980” or “during or after 1980”. Your goal is to reach or exceed 90% accuracy.
I spent a large amount of time tuning the model’s parameters but was only able to achieve an accuracy of about 88%. When I was in one of the tutoring labs this last week, the tutor said something. He said that one of the thing he does it to take a moment and think about what he knows about the data. So I asked myself the question: “What do I know about pre 1980 houses vs more recently built ones?” A few things came to mind.
First, older homes are less likly to have any portion of their basement unfinished. Over the years it is likely that any unfinished areas would have been completed.
Second, older homes generally have a lower value per square foot.
Third, an older home will have sustained more wear and tear over the years. Newer homes would be more likly to be in excellend condition, whereas older ones would be more likly to only be average or worse.
With these three things in mind I created some new variables that the model could make use of.
Read and format data
# Include and execute your code here# Tuning the data# Older homes are more likely to not have anything left unfinishedh_subset['unfinishedbasement'] = h_subset['basement'] - h_subset['finbsmnt']# Older homes will ahve a lower value per square footh_subset['pricepersqft'] = h_subset['sprice'] / h_subset['livearea']# Older homes would be more likely to have sustained more wear and tearh_subset['condition'] = h_subset['condition_Excel'] *10+ h_subset['condition_VGood'] *8+ h_subset['condition_Good'] *6+ h_subset['condition_AVG'] *4
Once I made thise adjustments, all three models that I had previously attempted to use were able to achieve accuracy levels over 90%
Justify your classification model by discussing the most important features selected by your model.
As shown by the following chart, the most important feature by far is the number of stories. It is several times more effective than any of the other features. My initial suspition that the condition of the house might be a good indicator of age proved to be correct as that data engineered feature ended up being 2nd most important. The number of baths that a home has also ended up beign an important feature to use to train the model.
Show the code
import matplotlib.pyplot as pltimport seaborn as sns# Get feature importances from the trained XGBoost modelfeature_importances = xgb_model.feature_importances_# Create a DataFrame for feature importancesfeature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})# Sort the DataFrame by importance in descending orderfeature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)# Display the feature importance tableprint(feature_importance_df)# Create a heatmap of feature importancesplt.figure(figsize=(10, 6))sns.heatmap(feature_importance_df[['Importance']].sort_values('Importance', ascending=False), annot=True, cmap='viridis', fmt=".2f")plt.title('Feature Importances (XGBoost)')plt.xlabel('Importance')plt.ylabel('Features')plt.show()
Describe the quality of your classification model using 2-3 different evaluation metrics.
Between the three classification models that I used I beleive that Hist GBM ended up being the best choice. All three models returned very similar numbers but the accuracy of 0.9014 made this model the best choice. All other scores remained closely related between the three models. The F1 score was also slightly higher than the other models but other than that the other metrics (Precision, Recall, R2, and Root Mean Squared Error) were almost identical between the three.